User Manual: TextHarvest

For additional help files, click here.

You can click on any section title below to jump directly to that section. Introduction First-Time User Questions General Overview A Simple Example Another Example TextHarvest Basics Word Lists Specifying a List of Words Combining Keep and Delete The Controls Input Box Text Case Null Lines File Name Annotation Regular Expressions Other Controls Advanced Techniques Shortcut Keys Wildcards Multiple Wildcards Processing the Windows Clipboard Boolean Filtering (Anding) Command Line Parameters Batch File Considerations The Error Reporting File The Log File Usage Notes Matching Problems False Matches Finding Slashes File Formats Regular Expression Syntax Overview Basic Regular Expressions Using the Asterisk Advanced Regular Expressions Scripting A Simple Example Scripting User Manual Sample Scripts Uninstalling TextHarvest Custom Conversion Legal Notices About TextHarvest Additional documentation (such as version history) can be found in the ReadMe.txt file.

Introduction

First-Time User Questions

What does TextHarvest do? It reads files and copies the information you want, altering it in various ways you specify. What kinds of files can it process? Text (Windows, DOS, Unix, Macintosh), fixed-record-length, and character- terminated. How much does TextHarvest cost? For most people: nothing. Certain exceptionally powerful capabilities require that you purchase a special license, but the average user will not need these. Can I give copies of TextHarvest to other people? Yes. Can I sell copies of TextHarvest to other people? No. You can charge a small distribution fee, though. I'm a programmer, so why would I need TextHarvest? Many operations that would take 100 lines of code in a traditional programming language can be performed with a single script command.

General Overview

In its simplest form, TextHarvest is a utility that copies a text file. As it does so, it can: Х Retain lines that contain specific text Х Skip lines that contain specific text Thus, you can use TextHarvest to filter a text file, preserving only those lines that interest you. You can also use powerful scripts to: Х Process files other than plain text Х Modify the data sent to the output file Х Perform further filtering and analysis Here are some typical scripted operations: Х Change one word to another one Х Change uppercase to lowercase Х Rearrange columns of text Х Look up values in a table Х Convert data to CSV (Comma Separated Value) format But for now, let's start with the basics ...

A Simple Example

TextHarvest comes with a demonstration file, named ThingsToDo.txt (click to view), which contains a simple "To Do" list. You can view this file by entering the name in the "Input File" box then clicking the corresponding View button. The first column contains a category, such as "Car, Home, Work", while the second column describes the task to be done. Let us say you only wanted to see the lines that contained the word "Work". Here is the way to do this: Х Specify the input file name (ThingsToDo.txt) Х Specify an output file name (Output.txt) Х Make sure Autoview is checked and Append is not checked Х Make sure the "Script file" input box contains the word "None" (no quotes) Х Put the word "Work" in the "/Keep list" like this: /Work Х Click the Start button (shortcut key: F9) TextHarvest will then read ThingsToDo.txt and copy only those lines that contain the word "Work" (or variations such as "WORK", or "work"). Then, because you checked "Autoview", the output file (Output.txt) will be displayed. Note: The reason we put a slash ("/") character in front of the word "Work" is explained later, in the section "Specifying a List of Words".

Another Example

Now let us suppose that you want to do the opposite of what you did in the previous example: you want to see every line except those that contain the word "Work". Remove the word "Work" from the "/Keep list" input box and put it in the "/Delete list" box like this: /Work When you click the Start button, TextHarvest will copy the file (ThingsToDo.txt) to the output file but will remove any lines that contain the word "Work".

TextHarvest Basics

Word Lists

Specifying a List of Words

Once again using ThingsToDo.txt, let us copy only those lines that contain the word "Home", or "Work", or both. Make sure that the "/Delete list" input box is empty, then enter the following in the "/Keep list" input box: /home/work When you click the Start button, the file will be copied Ч but only those lines that contain "Home" or "Work" (with variations, such as "HOME", "Work" and so on).

Combining Keep and Delete

You can specify both Keep and Delete lists. For example, let us say you used the following criteria: /Keep list: /work/home /Delete list: /inventory This would copy any lines with the words "work" or "home", but which do not contain the word "inventory". When Keep and Delete lists are both specified, a line is first checked to see if it passes the "Keep" test. If so, it is then compared to the "Delete" list. If a match is found, the line is not copied.

The Controls Input Box

Text Case

By default, TextHarvest will ignore text case when looking at the "/Keep list" and the "/Delete list". You can override this behaviour, though, using the "/Controls" input box. Here are the various settings: /KI = Ignore case on Keep (default) /KM = Match case on Keep /DI = Ignore case on Delete (default) /DM = Match case on Delete Try using the sample input file ThingsToDo.txt to test this out: Х Make sure the "Script file" input box contains the word "None" (no quotes) Х Set your "/Keep list" to "/CAR/work" Х Make sure your "/Delete list" is empty Х Set your "/Controls" input box to "/KM" Х Click the Start button The output will contain references to "CAR", but will ignore the lines that start with "WORK" because "WORK" (which is in uppercase) does not match "work" (which is in lowercase).

Null Lines

By default, TextHarvest ignores all null (zero-length) lines in the input file. However, you can set the "/Controls" input box to deal with this. Here are the settings: /NI = Ignore null lines (default) /NK = Keep null lines /NS = Keep null lines, but never output more than two in a row Try using the sample input file ThingsToDo.txt to test this out... Х Make sure the "Script file" input box contains the word "None" (no quotes) Х Clear the "/Keep list" input box Х Set "/Delete list" to "/car/work" Х Set "/Controls" to "/NK". Х Click the Start button The output will not contain any lines containing "car" or "work", but it will contain any null lines found in the input file. Try the experiment again, first with "/NS" and then with "/NI". (Since "/NI" is the default, you could also simply leave the "/Controls" input box blank.)

File Name Annotation

If you are processing multiple files using wildcards, you may wish to know which output lines came from which files. TextHarvest can annotate the output such that the file name precedes lines extracted from a particular file: /FN = No, do not output the file name (default) /FY = Yes, output the file name /FS = Yes, output the file name, and put separator lines above and below The separator line (control /FS) makes it easier to spot the file names in a long output file. Only the file names of files that actually generate output lines are included. If a file does not generate any input lines, its name is not mentioned. File name annotation lets you use TextHarvest as a "Find Text" utility. For example, if you wanted to search a folder for the word "inventory", you could do this: Х Set the "Input file" box to the wildcard "Things*.txt" (without the quotes) Х Make sure the "Script file" input box contains the word "None" Х Set the "/Keep list" input box to "/inventory" Х Make sure your "/Delete list" is empty Х Set the "/Controls" input box to "/FS" or "/FY" Х Click the Start button The example given above would search all files matching the wildcard pattern Things*.txt extension for the word "inventory".

Regular Expressions

By default, TextHarvest will search for the precise text fragments you specify in the /Keep and /Delete lists. However, you can enable "regular expressions", which let you match patterns rather than specific sequences of characters: /KR = Enable regular expressions for the /Keep list /DR = Enable regular expressions for the /Delete list Consider the following /Keep list: /D.g/C[aou]t With /KR specified in the "/Controls" input box, this would match any line that contained "Dog", "Cat", "Cot", "Cut". It would also match lines containing "Dig" and "D3g", so when you are using regular expressions you must ensure that you are indicating precisely what you want. If you have never used regular expressions before, you may find them a bit confusing at first, but with a bit of practice you will come to appreciate just how much power they put at your fingertips. Please see "Regular Expression Syntax" for additional examples of regular expressions.

Other Controls

Autoview, if checked, displays the output file after processing (if there is anything to display). If it is not checked, you have to click the View button to see the output. Append, if checked, places the output at the end of the specified output file. If it is not checked, the original copy of the output file (if it exists) is renamed with a .BAK extension and a new version is created.

Advanced Techniques

Shortcut Keys

Wildcards

Multiple Wildcards

You can specify multiple wildcards by using semicolons, as in this example: *.txt;*.me This would process input files with the .txt exension (example: xyz.txt) and the .me extension (example: read.me). There is no limit to the number of wildcards you specify, but bear in mind that TextHarvest lets you process the same file more than once. Consider this example: *.txt;my*.txt This would process all files with a .txt extension, then all files with a .txt extension where the file name starts with "my". Thus, a file named "myfile.txt" would be processed twice. You cannot specify multiple file names for the output file. All output goes to a single output file.

Processing the Windows Clipboard

TextHarvest can read and write to the Windows text clipboard as if it was a regular text file. To read from the clipboard, enter CLIPBOARD in the "Input File" box. To write to the clipboard, enter CLIPBOARD in the "Output File" box. It is possible to do both at once. Of course, after processing, the original contents of the clipboard will have been overwritten. Tip: Most Windows programs let you copy selected text with Ctrl-C and paste with Ctrl-V.

Boolean Filtering (Anding)

Note: You can use the sample file ThingsToDo.txt to try out the examples given below. The examples should be entered in your /Keep list. Make sure that the "/Delete" and "/Controls" input boxes are empty, and that the "Script file" input box is set to "None". The lists of words (see "Word Lists") you enter in the "/Keep list" and "/Delete list" input boxes are typically a sequence of alternatives. For example, if your /Keep list is "/Cat/Dog/Cow" it means you want to keep lines that contain "Cat" or "Dog" or "Cow". This is called an "OR-list". However, sometimes you want to keep lines that contain all of the words you listed. That is to say, if even one of the words is missing, you don't want to keep the line. For this you need an "AND-list". TextHarvest's AND function is represented by two ampersands. Here is an example of ANDing... /Cat&&/Dog&&/Cow This will match any line that contains all three (Cat, Dog and Cow). You can combine ANDing and ORing, as in this example: /Cat/Dog/Cow&&/Moose This will match any line that contains any one of the first three items (Cat or Dog or Cow) AND also contains the word Moose. Now consider this example: /Cat/Dog/Cow&&/Moose/Antelope This will match any line that contains one of the first three items (Cat or Dog or Cow) AND also contains one of the next two items (Moose or Antelope). If any of the AND conditions is not met, the line does not match. For example, consider this list: /North/South&&/Up/Down&&/Back/Forth A line that contains North, Up and Back would match. A line that contains South, Down and Back would match. But a line that is missing both North and South would not match.

Command Line Parameters

To call TextHarvest from the command line (e.g. from a batch file or in a Windows shortcut), the following format is used: TextHarvest /i"Input.txt" /o"Output.txt" You can also specify the /Keep, /Delete and /Controls lists: /X"/keep/list" /Y"/delete/list" /Z"/control/list" To specify a script file, use /S as in this example: /S"ScriptSample01.txt" If you are not using a script, you should specify /S"None" to override whatever value TextHarvest had previously saved for that input box. For a general overview of command line parameters, start up TextHarvest as follows: TextHarvest /? This displays a window which summarizes the command-line options. The window is also displayed if your command line contains an option that TextHarvest does not recognize.

Batch File Considerations

The Error Reporting File

The Log File

In addition to the Error Reporting File, TextHarvest also creates a log file (named TextHarvest-Log.txt). TextHarvest uses the log file to record the date and time when processing started and ended. It also uses the log file to report anything that is slightly unusual but not a serious problem. You can view the Log File using the "Support Files" input box of the Parsing Parameters window; it will be listed in the drop-down list.

Usage Notes

Matching Problems

False Matches

Sometimes TextHarvest matches on strings of characters that you do not want matched. For example, if you set your /Keep list to /home/car while copying the sample file ThingsToDo.txt you will find that an additional line is included: WORK Buy toner cartridge for laser printer This was included because the characters "car" appear in the word "cartridge". You can get around this by explicitly indicating the space after "car": /home/car / An alternative solution in this particular case would be to set the /Keep list to "/HOME/CAR" and the /Controls setting to "/KM" (Keep: match case).

Finding Slashes

You will normally separate the words in your /Keep and /Delete lists with the slash ("/") character (e.g. "/home/work"). But what if you are looking for a slash? All you need to do is begin your word list with a different character, such as the "backslash" character ("\"). You can try processing the sample input file ThingsToDo.txt with the following "/Keep list" to see that this works as it should: \home\work In other words, the first character in the list becomes the delimiter which separates the words.

File Formats

If you do not use scripts, TextHarvest can read either Windows-style (CRLF-terminated) text files or Unix-style (LF-terminated) text files, and output is always a Windows style (CRLF-terminated) text file. If you do use scripts, TextHarvest can read all standard text files (including the Macintosh variety), fixed-record-length files, and character-terminated records, while output can be whatever you want it to be.

Regular Expression Syntax

Overview

TextHarvest supports most of the regular expression conventions. In the following list, the letters x, y and z stand in for any character. ^xxx Matches a sequence of characters at the start of a line xxx$ Matches a sequence of characters at the end of line x.x Matches a single character [xz] Matches a set of characters ("x" and "z" in this example) [x-z] Matches a range of characters (this example covers "x" to "z") x* Matches zero or more occurrences of the preceding character [xyz]* Matches zero or more occurrences from the preceding set [x-z]* Matches zero or more occurrences from the preceding range [^xyz] Matches any character but the ones specified [^x-z] Matches any character but the ones in the specified range The backslash (\) character has a special meaning in regular expressions: \x Means "take the next character literally" For example: \[ means the actual [ character rather than the start of a set or range \t Means "a tab character" (ASCII character 9)

Basic Regular Expressions

Note: In the following examples, we assume that case sensitivity has been turned on, using the /KM or /DM setting in the "/Controls" input box. Here are some examples of matches: C.t Matches Cat, Cot, Cut, Cxt, C3t etc. C[aou]t Matches Cat, Cot, Cut only B..d Matches Bird, Bred, Bead etc. ^Dog Matches Dog only if it is at the beginning of a line Moose$ Matches Moose only if it is at the end of a line Pa*d Matches Pd, Pad, Paad, Paaad etc.

Using the Asterisk

The last example given above uses the * character to indicate zero, one or more occurrences of a particular character Ч in this case, the letter "a". Unlike the * wildcard character used in file names, it does not match "any" character but is specific. That is why "Pa*d" would not match "Parsed"; the asterisk means "match zero or more of the preceding character specification". If you actually want to search for "Pa" followed by one or more letters and then "d", the correct syntax is: Pa[a-z][a-z]*d This means that we want to match "Pa", then a letter in the range from "a" to "z", then some number (including zero) of characters in the "a" to "z" range, and finally the letter "d". The character string "Parsed" would meet these criteria, as would "Pad", "Paid" and "Packed".

Advanced Regular Expressions

Note: In the following examples, we assume that case sensitivity has been turned on, using the /KM or /DM setting in the "/Controls" input box. Here are some more complicated examples of regular expressions: C[^ou]t Matches Cat, Cxt and so on, but not Cot or Cut C[ao]*t Matches Ct, Cat, Caat, Cot, Coot, Cooot, Coat, Coaoat etc. [0-9][0-9]* Matches numbers such as 0, 1, 01, 10, 25, 0990, 9999 etc. -[0-9][0-9]* Matches negative numbers such as -0, -1, -19, -12345 etc. In the last example, [0-9] is specified twice to ensure that at least one digit is found. Bear in mind that the * character means "zero or more occurrences". If you had specified "-[0-9]*" you would get a match within the sequence "Hello - there", since the "-" character is indeed found, followed by zero occurrences of the digits 0 through 9. You can create fairly complex patterns using regular expressions. Consider this example: \$[0-9][0-9]*\.[0-9][0-9] This would match dollar amounts with two decimal places, such as $0.00, $03.23, $3.14, $9.99, $1234.56 and so on.

Scripting

Parse-O-Matic Scripting lets you modify the results generated by TextHarvest. Scripting can examine the text lines that are retained after TextHarvest's /Keep and /Delete settings are taken into account. You could, for example: Х Replace one string of text with another one Х Convert some of the line to uppercase Х Eliminate certain lines on the basis of multiple criteria Х Rearrange the order of data items in a line Х Add up numbers and include totals at the end of the output All this Ч and much, much more Ч is possible with Parse-O-Matic Scripting. When using a script you will generally leave the /Keep and /Delete input boxes empty, since the script can do this kind of selection. The /Controls input box can be set to /NK to keep null lines or /NI (default) to ignore null lines.

A Simple Example

Here is a very simple example, using the sample ThingsToDo.txt file. Let us say you wanted to convert the "category" (CAT, CAR, HOME, WORK, LEISURE) to lowercase. To do this, you would use a text editor program to write a script file (let's call it ScrExperiment.txt) that looks like this: Category = $OutData[1 9] Description = $OutData[10 999] Category = ChangeCase Category 'Lowercase' OutEnd Category Description The first two lines extract the two parts of the output data from the variable named $OutData, which contains the line of text from TextHarvest. The third line converts the category to lowercase, while the final line sends the modified line to the output file. (Whenever you run TextHarvest's results through a script, it is up to the script to actually send the lines to the output file.) To run this script, you would enter its name Ч we called it ScrExperiment.txt Ч in the "Script File" input box of the Parsing Parameters window, then click the Start button. If you do not want to run a script Ч i.e. you simply want to use TextHarvest as a basic filter Ч enter "None" (without the quotes) in the "Script File" input box.

Scripting User Manual

A complete user manual for Parse-O-Matic Scripting is included with TextHarvest. Click here to access the "Parse-O-Matic Scripts" user manual.

Sample Scripts

Here is a list of the sample scripts included with TextHarvest: ЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧ ЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧ ЧЧЧ ЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧ Script File Name Input File to Use Adv Comments ЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧ ЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧ ЧЧЧ ЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧ ScriptSample01.txt ThingsToDo.txt - ScriptSample02.txt ThingsToDo.txt - ScriptSample03.txt InputSample01.txt - ScriptSample04.txt ToDoListFixed.dat - Fixed-record-length input ScriptSample05.txt ToDoListDelim.dat - Character-delimited input ScrSampleAdv01.txt ThingsToDo.txt Y ScrSampleAdv02.txt Scr*.txt Y Input file uses wildcard ScrExercise.txt ThingsToDo.txt Y Demonstrates all commands ЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧ ЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧ ЧЧЧ ЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧЧ Adv = Uses Advanced Scripting commands (see the Scripting user manual). It is best to study these scripts in the order they are listed above. To view a script, click on the button with the folder icon next to the "Script File" input box. You can then select a script and view it by clicking the View button. To try out the sample script ScriptSample01.txt: Х Set your "Input File" box to ThingsToDo.txt Х Set the "Output File" box to an appropriate file name (e.g. Output.txt) Х Make sure Autoview is checked and Append is not checked Х Clear your "/Keep list", "/Delete list" and "/Controls" input boxes Х Set the "Script File" input box to ScriptSample01.txt Х Click the Start button Once the output file is displayed, you may find it helpful to also view the input file, so you can understand how the output data was transformed.

Uninstalling TextHarvest

If you should need to uninstall TextHarvest, start up the Windows Control Panel, then click on Add/Remove Programs. Find TextHarvest on the list, and proceed with removal.

Custom Conversion

TextHarvest is handy and simple to use, but it has its limitations. That is a perennial problem with utilities: there always seems to be one feature missing Ч one that you urgently need! We invite you to visit our web site if you need a custom conversion application. Our company has been doing data conversion since 1985.

Legal Notices

TextHarvest™ and Parse-O-Matic™ are trademarks of Pinnacle Software. This document is Copyright © 2003 by Pinnacle Software. You may not distribute copies of this document without explicit permission from Pinnacle Software, except in conjunction with the complete and unaltered TextHarvest installation package. Please write to us if you would like to adapt this product or any of our other products to your own distributed application. The entire product (comprising software, documentation and supporting provisions) is presented as-is; we make no claim about (and disavow liability for) its suitability, accuracy, reliability, performance etc. If you should encounter a problem with the product, please write to us to find out if a solution is available.

User Manual: TextHarvest

Table of Contents

Introduction

First-Time User Questions

General Overview

A Simple Example

Another Example

TextHarvest Basics

Word Lists

Specifying a List of Words

Combining Keep and Delete

The Controls Input Box

Text Case

Null Lines

File Name Annotation

Regular Expressions

Other Controls

Advanced Techniques

Shortcut Keys

Wildcards

Multiple Wildcards

Processing the Windows Clipboard

Boolean Filtering (Anding)

Command Line Parameters

Batch File Considerations

The Error Reporting File

The Log File

Usage Notes

Matching Problems

False Matches

Finding Slashes

File Formats

Regular Expression Syntax

Overview

Basic Regular Expressions

Using the Asterisk

Advanced Regular Expressions

Scripting

A Simple Example

Scripting User Manual

Sample Scripts

Uninstalling TextHarvest

Custom Conversion

Legal Notices